Abstract

The boom of social networks has given rise to a large volume of user-generated contents (UGCs), most of which are freely and publicly available. The potential of using the rich set of UGCs to study people's personal attributes and personalized applications has been widely validated. Despite its value, UGCs can also place users at high privacy risks, which thus far remains largely untapped. Privacy is defined as the individual's ability to control what information is disclosed, to whom, when and under what circumstances. As people and information both play significant roles, privacy has been elaborated as a boundary regulation process, where individuals regulate interaction with others by altering the openness degree of themselves to others. In this paper, we aim to reduce users' privacy risks on social networks by answering the question of Who Can See What. Towards this goal, we propose a novel scheme to tackle the problem of boundary regulation comprising of descriptive, predictive and prescriptive components, as shown in Fig.1. In particular, we first collect a set of posts and extract a rich set of privacy-oriented features to describe the posts. We then proposed a novel taxonomy-guided multi-task learning model to identify what kind of personal aspects are uncovered by the given posts. At last, we constructed standard guidelines by 400 users to regularize users' actions for preventing their privacy leakage. Extensive experiments on a real-world dataset have well verified our scheme. We have released the data, code and parameters to facilitate the research community.

Framework

Fig1.Illustration of the proposed scheme for boundary regulation. In the first component, we build a comprehensive taxonomy of the personal aspects, collect a benchmark dataset from Twitter and extract a rich set of features to describe the UGCs. The second component presents a taxonomy-constrained model to detect whether the given post leak certain personal aspects. In the last component, according to the guidelines built via AMT, we suggest users what they should do.



Data Collection

To build our benchmark dataset, we collected the social posts for each category in the pre-defined taxonomy, respectively. In particular, we leveraged Twitter search service. In the light of this, we obtained 269,090 raw tweets.



Features

Feature nameDescriptionLink
LIWC LIWC, short for Linguistic Inquiry Word Count, is a psycholinguistic transparent lexicon analysis tool. We adopted the LIWC feature to capture the sensitivity of a given UGC. Data
Sentence2Vector In our work, we treated each tweet as a sentence, and utilized the Sentense2Vec tool to generate a fixed dimention (100) vector representation of each tweet. Data
Metadata We extracted several metadata features, such as the presence of hashtags, images, emojis and user mentions. Moreover, we also incorporated the timestamp as an important feature. Data
Privacy Dictionary A privacy dictionary, is a new linguistic resource for automated content analysis on privacy related texts. With the help of this dictionary, we can generate a 9-dimensional features. Data
Sentiment We utilized the Stanford NLP sentiment classifer to judge tweets' polarity. We assign each tweet with a value ranging from 0 to 4, corresponding to very negative, negative, neutral, positive, very positive. Data


Ground Truth

In our work, we constructed the ground truth about what has been revealed by a given post via Amazon Mechanical Turk. Using majority voting strategy to establish the final labels of each post, we finally obtained 11,368 labeled posts. The ground truth can be accessed by this.

Guidelines

We thus conducted a user study via AMT to build guidelines regrading disclosure norms of different circles. Considering the existence of cultural difference, we launched a cross-cultural survey within two distinct areas: the U.S. and Asia. The complete results are listed as follows.


Aspects The U.S. Asia
familyclosecasualoutsidefamilyclosecasualoutside
family/association194180101401721305816
home address1911421061611463612
negative emotion17718711973981556240
contact1911743761551615521
passing away19117370241761496215
relationship status18919215210414617010463
current location18917563181501557023
religion186173118831461489260
self promotion196193188171150162141131
health conditions19614235141701303915
specific complaints1071565635731365638
treatments19215337101761312915
friendship1811941515611918711040
places planning to go190183102431541747827
age197192149801791587632
activities outside of home and work188189123421321629544
employer(company)1871811114514016010761
drug/alcohol1011716417481524019
graduation1951901608615817012075
birthday1861881466616516910039
full name19417994351531568852
relationship status change179192102381521677023
have babies191193127491801568126
activities at home190186123701581376726
gender19119216912715115310665
activities at work188184111521181659633
education190187161118155158150100
occupation1871811338114215111669
salary1768620201431144029
career promotion193189125561671689141
general complaints18418816711913415810464
positive emotion1901951661081511738942